Google Analytics Categorization

Categorizing ADC site page URL’s to more easily analyze user engagement.


Google Analytics Definitions

Page: The page shows the part of the URL after your domain name (path) when someone has viewed content on your website. For example, if someone views https://www.example.com/contact then /contact will be reported as the page inside the Behavior reports.
User: An individual visitor to the site (tracked using browser cookies)
Sessions: A single visit to the website, consisting of one or more pageviews, and any other interactions (The default session timeout is 30 minutes)
User % of Total: Users displayed as a percentage of the Total Users during the report period
Pageviews: The number of times users view a page that has the Google Analytics tracking code inserted. This covers all page views; so if a user refreshes the page, or navigates away from the page and returns, these are all counted as additional page views.
Unique Pageviews: The unique pageview is the count of all the times the page was viewed in an individual session as a single event. If a user viewed the page once in their visit or five times, the number of unique pageviews will be counted as just one
Entrances: Entrance represents the number of visits that started on a specific web page or group of web pages. I.e. the first page that someone views during a session
Bounce Rate: The Bounce Rate is Bounce measured in percentage. It represents the number of visits when users leave your site after just one page view, regardless of how long they stayed on that page. (Total Bounces divided by total visits)



Categorization Function

We will use the code and function below to categorize the Google Analytics dataset. The function takes messy character data within a dataframe and categorizes it based on a set of search string criteria. The inputs are the data frame, the column name of the messy data, a list of search strings, a list of category names (these have to be correlated), and you have the option of naming the new column.

It is important to note that the order of the search strings matters for strings that are repeats – i.e. “catalog” and “catalog/submit” will be written over so you must identify the longer string first (i.e. catalog/submit). Additionally, make sure the order of the categories list correlates with the order of the search strings.
Source: https://github.com/lenwood



Identify Search Strings and Category Names

# List of search strings -- note that the longer search strings are identified first 
search <- c("news", "portals", "about","catalogprofile", "catalogsubmit", "catalog", "training", "team", "home", "view", "submit", "profile")

# List of categories
categories <- c("News", "Portals", "About", "Summary", "Submit", "Cathome", "Training", "Team", "Home", "Dataset", "WhoMustSub", "Summary")


Create Categorization Function

# Quickly categorize a data frame with a column of messy character strings. 

# Replace "df" with your messy dataframe.

categorizeDF <- function(df, searchColName, searchList, catList, newColName="Category") {
  # create empty data frame to hold categories
  catDF <- data.frame(matrix(ncol=ncol(df), nrow=0))
  colnames(catDF) <- paste0(names(df))

  # add sequence so original order can be restored
  df$sequence <- seq(nrow(df))

  # iterate through the strings
  for (i in seq_along(searchList)) {
    rownames(df) <- NULL
    index <- grep(searchList[i], df[,which(colnames(df) == searchColName)], ignore.case=TRUE)
    tempDF <- df[index,]
    tempDF$newCol <- catList[i]
    catDF <- rbind(catDF, tempDF)
    df <- df[-index,]
  }

  # OTHER category for unmatched rows
  if (nrow(df) > 0) {
    df$newCol <- "OTHER"
    catDF <- rbind(catDF, df)
  }

  # return to the original order & remove the sequence data
  catDF <- catDF[order(catDF$sequence),]
  catDF$sequence <- NULL

  # remove row names
  rownames(catDF) <- NULL

  # set Category type to factor
  catDF$newCol <- as.factor(catDF$newCol)

  # rename the new column
  colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
  catDF
}


Call Function and Categorize Data

# Replace "df" with messy dataframe

# Identify which column you want to categorize -- in our case with Google Analytics, we will be categorizing the "Page" column that contains messy URL strings. Additionally, you can name the new column that contains the categories (e.g. "Category").

sorted <- categorizeDF(df, "column name with messy data", search, categories, "new category column name")



Test Run of Categorization with Small Subset of Data

###### TEST DATASET ######


# Remove backslashes and other symbols from Page column (includes hyphens and periods). **** Not sure if this is necessary. Am trying to differentiate the single "/" as the ADC Homepage, and make it easier to identify search terms for the function below. 
test_users_clean <- top_30_users %>%
  mutate_all(funs(gsub("[[:punct:]]", "", .)))


# Rename home page as "home" in dataframe **NOTE that for this particular dataset the "Home" page is the top viewed page and so I put in [1]. If it is not the top viewed page you will need to determine which row the homepage is and put that row number in the brackets. *** Is there a better way to do this?? ***

test_users_clean$Page[1] <- "home"


### Categorize the page URLS in the Page column into larger categories using a function ###

## Create a list of search strings to sort through pages and a list of categories (these must be correlated) **Order matters for strings that are repeats -- i.e. "catalog" and "catalog/submit" will be written over so you must identify the longer string first (i.e. catalog/submit). 

# List of search strings
search <- c("news", "portals", "about","catalogprofile", "catalogsubmit", "catalog", "training", "team", "home", "view", "submit", "profile")

# List of categories
categories <- c("News", "Portals", "About", "Summary", "Submit", "Cathome", "Training", "Team", "Home", "Dataset", "WhoMustSub", "Summary")



## Create function [below] to categorize the messy "Page" column of the raw data frame. 
# This function takes looks at a data frame column of messy character (or factorial) data, and produces a new column of categorized data. The inputs are the data frame, the column name of the messy data, a list of search strings, a list of category names (these two have to be correlated), and you have the option of naming the new column.


# Function:
categorizeDF <- function(test_users_clean, searchColName, searchList, catList, newColName="Category") {
  # create empty data frame to hold categories
  catDF <- data.frame(matrix(ncol=ncol(test_users_clean), nrow=0))
  colnames(catDF) <- paste0(names(test_users_clean))

  # add sequence so original order can be restored
  test_users_clean$sequence <- seq(nrow(test_users_clean))

  # iterate through the strings
  for (i in seq_along(searchList)) {
    rownames(test_users_clean) <- NULL
    index <- grep(searchList[i], test_users_clean[,which(colnames(test_users_clean) == searchColName)], ignore.case=TRUE)
    tempDF <- test_users_clean[index,]
    tempDF$newCol <- catList[i]
    catDF <- rbind(catDF, tempDF)
    test_users_clean <- test_users_clean[-index,]
  }

  # OTHER category for unmatched rows
  if (nrow(test_users_clean) > 0) {
    test_users_clean$newCol <- "OTHER"
    catDF <- rbind(catDF, test_users_clean)
  }

  # return to the original order & remove the sequence data
  catDF <- catDF[order(catDF$sequence),]
  catDF$sequence <- NULL

  # remove row names
  rownames(catDF) <- NULL

  # set Category type to factor
  catDF$newCol <- as.factor(catDF$newCol)

  # rename the new column
  colnames(catDF)[which(colnames(catDF) == "newCol")] <- newColName
  catDF
}


# Call the function and create new data frame - using the raw data frame, the messy column you want to sort, the search and category lists, and name of the new column

sortedDF <- categorizeDF(test_users_clean, "Page", search, categories, "Category")


knitr::kable(sortedDF, format = "html")
Page Users Sessions Users_._of_Total Pageviews Unique_Pageviews Entrances Bounce_Rate Category
home 25436 42464 0440145 75951 55359 42443 0421957423 Home
catalog 4310 2280 0257363 12070 8734 2131 0247368421 Cathome
catalog 4130 416 0195397 33 26 19 0033653846 Cathome
data 3291 2306 0160785 19380 9507 2319 0273200347 OTHER
catalogdata 3114 3395 0139405 14923 7964 3298 0253608247 Cathome
about 2637 941 0123776 3942 3297 944 0582359192 About
team 1634 614 0110133 2554 2117 615 0684039088 Team
submit 1384 898 009936 3255 2395 901 0643652561 WhoMustSub
page0 1174 1580 0090577 5362 3813 1582 0158860759 OTHER
training 1166 892 0083537 2245 1639 892 0515695067 Training
publications 1120 431 0077705 1701 1326 432 0744779582 OTHER
share 1060 528 0072758 4532 2706 529 0357954545 OTHER
profile 989 232 0068477 1609 1338 238 061637931 Summary
qanda 912 214 0064713 1430 1181 214 0570093458 OTHER
january2019datasciencetrainingforarcticresearchers 903 1004 0061441 1359 1191 1004 0815737052 Training
datapage0 873 556 0058545 4704 2254 557 0303956835 OTHER
catalogprofile 799 193 0055914 1376 1121 188 0564766839 Summary
proposals 773 660 0053551 1187 1008 661 0762121212 OTHER
homehtm 735 800 0051402 936 817 800 056875 Home
support 729 121 0049463 1308 1004 122 058677686 OTHER
dataplans 685 384 0047672 932 827 384 0841145833 OTHER
2018datasciencetrainingforarcticresearchers 649 639 0046015 982 876 639 0723004695 Training
news201606datascienceopportunities 629 371 0044488 810 733 371 0851752022 News
upcomingdatasciencetrainingforarcticresearchers 599 612 0043066 857 767 612 0823529412 Training
catalogsubmit 582 302 0041746 1657 1196 304 0523178808 Submit
catalogportalspermafrost 548 651 0040505 874 722 650 0769585253 Portals
reconcilinghistoricalandcontemporarytrendsinterrestrialcarbonexchangeofthenorthernpermafrostzone 546 844 0039355 1189 995 844 0808056872 OTHER
viewdoi103334CDIAC00001V2017 522 562 0038272 769 619 562 0807829181 Dataset
catalogshare 521 399 0037263 1197 994 378 0483709273 Cathome
categorynews 512 197 0036317 822 672 197 0624365482 News





Visualizations for User Analysis


Total Users Over Time

Remember that Users are all individual visitors to the site tracked by browser cookies. If a User visits the site multiple times with the same browser, they will not be counted twice.


##  2016  2017  2018  2019  2020 
## 23772 32205 48337 31187 60330


Tree Map for Total Users by Category and Year


Tree Map for Total Users 2016-2020



Circular Bar Plot for Top 100 Users per Category in 2016




Circular Proportion Graph for Total Users by Category

WORK IN PROGRESS

# Create circular graph that shoes proportion of users within each category (34 total categories)

circos.clear() 


category = annual_sortedDF$Category
percent = sort(sample(40:80, 34))
color = rev(rainbow(length(percent)))


circos.par("start.degree" = 90, cell.padding = c(0, 0, 0, 0),
           canvas.xlim=c(-1.2, 1.2),   # bigger canvas?
           canvas.ylim=c(-1.2, 1.2)) 
circos.initialize("a", xlim = c(0, 100)) # 'a` just means there is one sector
circos.track(ylim = c(1, length(percent)+1), track.height = 0.9, 
    bg.border = NA, panel.fun = function(x, y) {
        xlim = CELL_META$xlim
        circos.segments(rep(xlim[1], 34), 1:34,
                        rep(xlim[2], 34), 1:34,
                        col = "#CCCCCC")
        circos.rect(rep(0, 34), 1:34 - 0.45, percent, 1:34 + 0.45,
            col = color, border = "white")
        circos.text(rep(xlim[1], 34), 1:34, 
            paste(category, " - ", percent, "%"), 
            facing = "downward", adj = c(1.05, 0.5), cex = 0.8) 
        breaks = seq(0, 85, by = 5)
        circos.axis(h = "top", major.at = breaks, labels = paste0(breaks, "%"), 
            labels.cex = 0.6)
})



Most Used Search Terms in Data Catalog



Top 10 Datasets Per Year

Top 10 Viewed Datasets Per Year
Dataset Users Pageviews Year
Collaborative Research: Toward a Circumarctic Lakes Observation Network (CALON)– Multiscale observations of lacustrine systems. 60 175 2016
MAR v3.2 regional climate model data for Greenland (1958-2013). 36 65 2016
USGS Permafrost Temperatures Acquired from the DOI/GTN-P Deep Borehole Array in Arctic Alaska, 1973-continuing. 35 56 2016
The State of the Arctic Sea Ice Cover: Sustaining the integrated seasonal ice zone observing network. 32 133 2016
Collaborative Research: Sensitivity of Circum-Arctic Peatland Carbon to Holocene Warm Climates and Climate Seasonality. 32 131 2016
Nansen and Amundsen Basins Observational System. 32 91 2016
Arctic Shorebird Demographics Network. 29 97 2016
Passive acoustic data from A2 in the Bering Strait. 28 66 2016
Arctic Great Rivers Observatory. 27 91 2016
CTD and Mooring Data from the Eastern Eurasian and Makarov Basins, and Northern Laptev and East Siberian Seas from 2013-2015. 26 50 2016
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset 55 71 2017
Collaborative Research: Toward a Circumarctic Lakes Observation Network (CALON)– Multiscale observations of lacustrine systems. 42 162 2017
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow depth on sea ice subdataset. 41 51 2017
Nordenskioldland, Svalbard Reindeer Carbon and Nitrogen Isotope Data. 35 90 2017
USGS Permafrost Temperatures Acquired from the DOI/GTN-P Deep Borehole Array in Arctic Alaska, 1973-continuing. 35 64 2017
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) accumulation on land ice subdataset. 34 58 2017
Arctic Shorebird Demographics Network. 33 75 2017
Integrating Passive Acoustic Monitoring in long-term oceanographic observations of the Bering Strait. 32 143 2017
Automated ice mass balance site (SIZONET). 30 73 2017
NABOS - Water Quality and Physical Oceanography Data from the Eastern Eurasian and Makarov Basins, and Northern Laptev and East Siberian Seas in 2013. 27 48 2017
Photogrammetric scans of aerial photographs of North American glaciers. 61 213 2018
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow depth on sea ice subdataset, Antarctica and Baltic Sea, 1990-2018. 51 85 2018
Soil bacterial community and functional shifts in response to altered snow pack in moist acidic tundra of Northern Alaska. 50 131 2018
A synthesis dataset of near-surface permafrost conditions for Alaska, 1997-2016. 50 82 2018
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 49 86 2018
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow depth on sea ice subdataset. 44 62 2018
Arctic Shorebird Demographics Network. 43 82 2018
Modèle Atmosphérique Régional (MAR) three-dimensional regional climate model (RCM), version 3.2, over Greenland, 1948-2016. 41 71 2018
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) accumulation on land ice subdataset, Greenland and Antarctica, 1987-2018. 36 63 2018
Floating and bedfast lake ice regimes across Arctic Alaska using space-borne SAR imagery from 1992-2016. 35 98 2018
Estimating the Freshwater Flux from the Greenland Ice Sheet Workshop Report, American Geophysical Union, 2018. 404 533 2019
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 97 145 2019
Pacific arctic sea-ice observations from U.S. Federal logbooks (1900-1938). 82 127 2019
North Pole Environmental Observatory Bottle Chemistry. 75 147 2019
Arctic Shorebird Demographics Network. 54 100 2019
Arctic Great Rivers Observatory III Biogeochemistry and Discharge Data, 2017-2019. 47 92 2019
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 45 65 2019
A synthesis dataset of near-surface permafrost conditions for Alaska, 1997-2016. 45 53 2019
Arctic Shorebird Demographics Network. 43 46 2019
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) accumulation on land ice subdataset, Greenland and Antarctica, 1987-2018. 43 52 2019
Arc5km2018: Arctic Ocean Inverse Tide Model on a 5 kilometer grid, 2018. 114 202 2020
North Pole Environmental Observatory Bottle Chemistry. 84 132 2020
Global Seasonal Snow Classification System. 71 133 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 68 124 2020
AOTIM5: Arctic Ocean Inverse Tide Model, on 5 kilometer grid, developed in 2004. 65 188 2020
Arctic Tidal Current Atlas from Moored Current Observations, Arctic Ocean, 1998-2018. 53 105 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) accumulation on land ice subdataset, Greenland and Antarctica, 1987-2018. 51 70 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow depth on sea ice subdataset, Antarctica and Baltic Sea, 1990-2018. 50 59 2020
Pacific arctic sea-ice observations from U.S. Federal logbooks (1900-1938). 46 54 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 45 79 2020



Top 20 Datasets of all Time

Top 20 Viewed Datasets of All Time
Dataset Users Pageviews Year
Estimating the Freshwater Flux from the Greenland Ice Sheet Workshop Report, American Geophysical Union, 2018. 404 533 2019
Arc5km2018: Arctic Ocean Inverse Tide Model on a 5 kilometer grid, 2018. 114 202 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 97 145 2019
North Pole Environmental Observatory Bottle Chemistry. 84 132 2020
Pacific arctic sea-ice observations from U.S. Federal logbooks (1900-1938). 82 127 2019
North Pole Environmental Observatory Bottle Chemistry. 75 147 2019
Global Seasonal Snow Classification System. 71 133 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 68 124 2020
AOTIM5: Arctic Ocean Inverse Tide Model, on 5 kilometer grid, developed in 2004. 65 188 2020
Photogrammetric scans of aerial photographs of North American glaciers. 61 213 2018
Collaborative Research: Toward a Circumarctic Lakes Observation Network (CALON)– Multiscale observations of lacustrine systems. 60 175 2016
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset 55 71 2017
Arctic Shorebird Demographics Network. 54 100 2019
Arctic Tidal Current Atlas from Moored Current Observations, Arctic Ocean, 1998-2018. 53 105 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow depth on sea ice subdataset, Antarctica and Baltic Sea, 1990-2018. 51 85 2018
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) accumulation on land ice subdataset, Greenland and Antarctica, 1987-2018. 51 70 2020
Soil bacterial community and functional shifts in response to altered snow pack in moist acidic tundra of Northern Alaska. 50 131 2018
A synthesis dataset of near-surface permafrost conditions for Alaska, 1997-2016. 50 82 2018
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow depth on sea ice subdataset, Antarctica and Baltic Sea, 1990-2018. 50 59 2020
Surface Mass Balance and Snow Depth on Sea Ice Working Group (SUMup) snow density subdataset, Greenland and Antartica, 1950-2018. 49 86 2018